Skip to content

perf: single-format rkyv serialization + zero-copy in-place verifier#769

Open
Oppen wants to merge 36 commits into
pr/vkeyfrom
perf/rkyv-serialization
Open

perf: single-format rkyv serialization + zero-copy in-place verifier#769
Oppen wants to merge 36 commits into
pr/vkeyfrom
perf/rkyv-serialization

Conversation

@Oppen

@Oppen Oppen commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Replaces postcard and serde with rkyv as the sole proof serialization format, and rewrites the STARK verifier as a single implementation over &ArchivedStarkProof, read in place from the archive — no deserialization pass, no per-field allocation. On little-endian, an archived field element is bit-identical to a native one (ArchivedFieldElement transparent newtype + slice_as_native), so the recursion guest verifies the proof straight out of its private-input buffer.

Numbers (recursion verifier guest, inner empty proof)

postcard (base) rkyv (this PR) Δ
1 query, blowup=2 115,264,038 cycles 89,721,844 −22.2%
multi-query, blowup=8 2,975,750,844 cycles 2,210,366,539 −25.7%
decode/setup step 21.89M / 700.8M cycles 171 eliminated

Design

  • One wire format, one verifier. Owned multi_verify is now a serialize-then-delegate shim, so every host verification (incl. all tests) exercises the actual archive format. PI is materialized per table only where the AIR API needs it (() for the VM — free).
  • Recursion blob: 16-byte LVMR magic/version prefix + rkyv archive, encoded copy-free into an AlignedVec seeded with the prefix. Guest borrows the mapped input region (get_private_input_slice), validates the prefix, verifies in place, commits vk_digest ‖ public_output.
  • Private-input ABI (breaking): payload moves to PRIVATE_INPUT_START + 16 (header = len: u32 + 12 reserved), so the payload base is 16-aligned and structured input needs no pad arithmetic. Rebuild guest ELFs.
  • vkey digest is now a framework-free fixed-width canonical encoding (exhaustive destructure; injective) instead of hashed postcard bytes — digest values change.
  • CLI proof files switch bincode → validated rkyv (format break).
  • ethrex quarantine: ethrex pins rkyv unaligned, a global archived-layout switch that feature-unifies across a workspace build and would silently flip the proof format. Its host-reference tests move to the detached workspace tooling/ethrex-tests (make test-ethrex), and our crates pin aligned so any reintroduction is a hard compile error (verified: re-adding the dep trips rkyv's mutual-exclusion guard).

Verification

Adversarially reviewed for soundness, robustness, and performance: verifier rewrite is check-for-check identical to the owned path (transcript order, iteration bounds, Montgomery-form transmute analyzed — non-canonical bit patterns can only reject); hardened for in-place reads (OOD dimension/height gates, deep-openings count guard, FRI decommitment length equality — the latter also closes a pre-existing skip of the fri_last_value check); malformed guest blobs are self-DoS only, no false-accept path. Full workspace suite green (492 prover / 137 stark tests; every host verify round-trips the archive), host blob roundtrip/tamper/misalignment tests green, in-VM end-to-end verifies and commits.

Oppen and others added 30 commits July 2, 2026 15:06
Add the measurement/profiling harness for the in-VM STARK verifier:

- `empty`-proof and `deserialize-only` bench guests + `sp1/verifier`
  cross-prover comparison, all exercising the no_std verifier.
- Expand the recursion smoke test with PC-histogram, sampled-flamegraph,
  page-count, cycle-count and per-step-breakdown diagnostics, plus the
  `make test-profile-recursion` targets and the histogram-aggregation
  CI script/workflow.
- Expose read-only `Executor::memory()`, `Memory::cells()` and
  `SymbolTable::functions()` accessors and make `flamegraph::demangle`
  public so the diagnostics can resolve guest PCs to functions.
The top-100 per-address table carried bare PCs with no file:line, so it was
not actionable for optimization and the CI aggregator already discarded it.
Keep the per-function fold (the view that matters); terminate the aggregator's
function-table parse on the trailing rule instead of the removed PC header.
Extract setup_guest_run (blob build + ELF load + Executor::new) and a
log_progress throttled-readout factory, used by the cycle-count, page-count,
PC-histogram, sampled-flamegraph and step-breakdown diagnostics. Generalize
the PC-histogram runner over guest name + progress stride so the
deserialize-only histogram is a one-line caller instead of a near-duplicate.
Collapse the cycle-count, PC-histogram and step-breakdown diagnostics into one
parameterized run_profile(guest, stride, opts, detailed): total cycles print
unconditionally, the top-25 functions + per-step breakdown gate on detailed
(they share one streamed pass over the same PC stream). Every variant now comes
in 1query and multiquery flavours for both recursion and the deserialize-only
control. Route execute_outer_and_commit through drive_executor too — the rebased
streaming finish() makes its hand-rolled drain loop redundant.
Add deserialize-only to RECURSION_GUESTS and migrate the guest to the recursion
guest's std shape (lambda_vm_syscalls + build-std std), since the old no_std
panic handler collided with std. Add getrandom_backend="custom" to its cargo
config (transitive getrandom 0.3 needs it) and track its Cargo.lock. The deser
control guest now builds and its profile tests run.
The smoke pipelines already host-verify the inner proof, so building with
--features stark/instruments surfaces the per-step timings; the dedicated test
was just that verify minus the guest run. Documented the flag in the module doc.
It was never wired into the bench harness or CI (run.sh uses sp1/fibonacci),
and its in-VM verifier-cost comparison is superseded by the recursion profile
tests in this PR.
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
For some reason that alone appears to inhibit completely the effects of
next PRs pre-built commitments and vkey optimization.
Steps detection was misbehaving due to inlined functions not emitting
symbols. The solutions were either marking `#[inline(never)]`, which in
the case of `replay_rounds_after_round_1` inhibits optimizations.

Since we added the dependency, we took more advantage of it and expanded
the detailed profile with a per file:line breakdown as well.
Replace symbol/DWARF-based verifier-step detection with an explicit
addi x0,x0,N marker instruction, immune to inlining. Adds a
STEP_DECODE_DONE marker in the recursion guest itself, making the
deserialize-only control guest (manual A/B subtraction) redundant.
Too much reviewer overhead for their current value. The sampled
flamegraph will come back once the executor's flamegraph tooling makes
it simple to reimplement; the page-count histogram isn't interesting
right now.
SymbolTable::functions() and Memory::cells() existed solely for the
flamegraph/page-count smoke tests just deleted.
Add STEP_AIRS_AND_BUS_BALANCE_DONE marker so the verifier's preprocessed
FFT+Merkle commitment build (VmAirs::new) is bucketed separately from
postcard decode and from multi_verify's transcript replay. The top-25
cycle table now also tags each row with its verifier step, so e.g. how
much of step4:openings is keccak is visible at a glance. Update the CI
histogram aggregator to parse and render the new step column.
The previous split added a step column to one combined top-25 table,
losing per-step rank/cum% fidelity. Print the global top-25 (all steps
folded together) plus a separate top-25 table per verifier step, so
each step's own hottest functions and their cumulative share are
visible directly. Update the CI aggregator to parse and render the new
multi-table output.
Per-step tables previously used the global cycle count as the pct
denominator, so a function dominating a cheap step (e.g. 90% of
step2:claimed) rendered as a near-zero percentage of the whole run —
useless for spotting what dominates within that step. Use each step's
own cycle total as the denominator for its table instead; the global
table still uses the run's total. Update the CI aggregator to parse
and surface the per-step denominator.
…filing

decode_step_marker required only dst==0, matching the canonical NOP
(addi x0, x0, 0) as marker 0; pin src==0 and imm!=0 per the documented
addi x0, x0, N convention.

run_profile latched the step bucket at the highest marker ever seen,
so multi_verify's per-AIR-table 3,4,5,6 repetition folded every table
after the first into bucket 6. Track the latest marker instead.
Add `VmVerifyingKey` (prover/src/vkey.rs): host-derived cache of the five
preprocessed-table Merkle commitments (BITWISE, DECODE, REGISTER,
KECCAK_RC, per-PAGE). `VmAirs::new_with_vkey` /
`verify_with_options_with_vkey` take the cached commitments instead of
recomputing them — recomputation is ~87% of verifier cycles inside the
recursion guest. Soundness is preserved by Fiat-Shamir.

The recursion and deserialize-only guests and the smoke test now encode
the vkey into the postcard blob `(VmProof, elf, opts, vkey)`.
Oppen added 6 commits July 2, 2026 17:26
A prover-supplied vkey defeats the preprocessed-commitment check:
Fiat-Shamir only catches post-hoc tampering, not a coordinated
prover committing to a forged table with a matching vkey. Bind
keccak(vkey) as vk_digest: embed ProofOptions in the vkey (query
count and grinding factor pin no commitment), stamp the digest into
VmProof and the statement transcript (V3), check it in verify
before STARK work, and commit vk_digest || inner output from the
recursion guest so the outer verifier can compare against a digest
derived from the trusted inner ELF.

Also validate vkey version/page-count instead of panicking on
short pages, and reject on options mismatch.
The comment should still say when it needs bumping, not why it was bumped the last time.
The host roundtrip test still decoded (VmProof, elf, opts); postcard
discards trailing bytes, so it silently skipped the vkey and the
vkey verify path the guest actually exercises.
Replace postcard and serde with rkyv as the sole proof serialization.
The STARK verifier gets one implementation over ArchivedStarkProof,
reading the archive in place: archived field elements are bit-identical
to native ones on little-endian (ArchivedFieldElement transparent
newtype + slice_as_native), so the recursion guest verifies straight
from its input buffer with no deserialization pass and no per-field
allocation. Owned multi_verify becomes a serialize-then-delegate shim,
so every host verification also exercises the wire format.

- recursion blob: 12-byte LVMR magic/version prefix 16-aligns the
  archive at PRIVATE_INPUT_START+4+12 (const-asserted); guest borrows
  the input region (get_private_input_slice), commits
  vk_digest ‖ public_output from verify_recursion_blob
- vkey digest: framework-free fixed-width canonical encoding
  (exhaustive destructure; injective), replacing postcard bytes
- CLI persistence: bincode -> validated rkyv from_bytes/to_bytes;
  proof files change format
- Table: manual rkyv impl under disk-spill reads via row_major_data,
  archive layout identical to the derive
- ethrex host-reference tests move to tooling/ethrex-tests (detached
  workspace): ethrex pins rkyv 'unaligned', a global archived-layout
  switch that must not feature-unify with the aligned proof format;
  our crates pin 'aligned' so any reintroduction is a compile error
- harden verifier for in-place reads: OOD dimensions_consistent +
  height>0 gate, deep-openings count guard, aux-width checked_sub,
  FRI decommitment length equality (fixes pre-existing skip of the
  fri_last_value check via over-long layers_evaluations_sym)
- verify_recursion_blob falls back to one aligned copy when the host
  buffer is misaligned (guest path stays zero-copy)

Recursion verifier guest, inner empty proof:
- 1 query (blowup=2): 115.26M -> 88.98M cycles (-22.8%)
- multi-query (blowup=8): 2.976B -> 2.211B cycles (-25.7%)
setup step (was postcard decode): 21.89M -> ~170 cycles
Widen the private-input header from 4 to 16 bytes ([len: u32 LE] +
12 reserved), moving the payload base to PRIVATE_INPUT_START + 16 —
16-aligned, so guests can read structured (rkyv-archived) input in
place with naturally-aligned loads instead of working around a
4-aligned base (the recursion blob's pad arithmetic; ethrex's rkyv
'unaligned' pin exists for the same reason and can eventually be
dropped upstream).

- executor: PRIVATE_INPUT_PAYLOAD_OFFSET = 16 (const-asserted
  16-aligned); store_private_inputs writes payload at +16
- syscalls: get_private_input(_slice) and ef_io read at +16
- prover: private_input_bytes mirrors the header in the PAGE/genesis
  image; verifier page bound uses the offset
- recursion prefix: 16 bytes (magic+version+reserved(8)) — pure
  framing now, sized to a multiple of the alignment;
  encode_recursion_input serializes directly after the prefix into an
  AlignedVec (no archive copy), so the host path is aligned by
  construction and the misaligned fallback only guards foreign buffers
- fixtures: fibonacci bench guest and test_private_input_xpage read
  the payload at +16; xpage now commits payload[0..8]

BREAKING: guests built against the +4 payload base read garbage;
rebuild all guest ELFs (make compile-programs).
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Benchmark Results for modified programs 🚀

Command Mean [ms] Min [ms] Max [ms] Relative
head ecsm 3.2 ± 0.1 3.1 3.5 1.00
Command Mean [ms] Min [ms] Max [ms] Relative
head hashmap 138.0 ± 2.8 133.4 141.1 1.00
Command Mean [ms] Min [ms] Max [ms] Relative
head keccak 130.6 ± 3.5 126.2 137.2 1.00
Command Mean [ms] Min [ms] Max [ms] Relative
head syscall_commit 91.8 ± 0.8 90.7 93.2 1.00

Comment thread executor/tests/rust.rs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out of the main workspace due to rkyv/unaligned feature bubbling up to lambda_vm. Not actually deleted.

@Oppen Oppen force-pushed the pr/vkey branch 2 times, most recently from a5515d1 to 65f990f Compare July 3, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant